-
-
Notifications
You must be signed in to change notification settings - Fork 69
New helper function to determine number of threads when using openMP #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new helper function, xss_get_num_threads, to determine the optimal number of threads based on the number of physical CPU cores, replacing hardcoded thread limits.
- Replaced hardcoded thread limits with xss_get_num_threads in multiple qsort and argsort functions.
- Added a new helper function in src/xss-common-includes.h to calculate the thread count.
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
src/xss-common-qsort.h | Removed hardcoded thread limit and now calls xss_get_num_threads(). |
src/xss-common-keyvaluesort.hpp | Removed hardcoded thread limit and now calls xss_get_num_threads(). |
src/xss-common-argsort.h | Removed hardcoded thread limit and now calls xss_get_num_threads(). |
src/avx512-16bit-qsort.hpp | Removed hardcoded thread limit and now calls xss_get_num_threads(). |
src/xss-common-includes.h | Introduced xss_get_num_threads to calculate thread count based on CPU cores. |
src/xss-common-includes.h
Outdated
{ | ||
// Get the number of physical cores: works only when hyperthreading is | ||
// enabled on all cores | ||
int num_physical_cores = omp_get_num_procs() / 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment indicates that this logic assumes hyperthreading is enabled on all cores. Consider expanding the documentation to clarify that on systems without hyperthreading this calculation may underutilize available cores or adjust the calculation for more robust behavior.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This patch seems to cause a major regression on my SPR system for 1m and 10m sizes, despite using far more CPU. 100m benefits a bit, and interestingly has lower CPU time as well:
1m
Comparing simdsort/random_1m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_1m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark Time CPU Time Old Time New CPU Old CPU New
---------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_1m vs. simdsort/random_1m]/uint64_t +0.6633 +0.6638 1796738 2988555 1796346 2988695
[simdsort/random_1m vs. simdsort/random_1m]/int64_t +0.6111 +0.6112 1813152 2921096 1812886 2920974
[simdsort/random_1m vs. simdsort/random_1m]/uint32_t +1.4261 +1.4277 774474 1878973 774148 1879380
[simdsort/random_1m vs. simdsort/random_1m]/int32_t +1.3888 +1.3907 789581 1886162 789146 1886582
[simdsort/random_1m vs. simdsort/random_1m]/uint16_t +1.4335 +1.4344 659436 1604750 659213 1604815
[simdsort/random_1m vs. simdsort/random_1m]/int16_t +1.4339 +1.4343 661424 1609820 661205 1609553
[simdsort/random_1m vs. simdsort/random_1m]/float +1.4058 +1.4067 785036 1888669 784693 1888504
[simdsort/random_1m vs. simdsort/random_1m]/double +0.7759 +0.7764 1398137 2482994 1397812 2483070
[simdsort/random_1m vs. simdsort/random_1m]/_Float16 +0.9243 +0.9252 1084081 2086116 1083642 2086215
OVERALL_GEOMEAN +1.0886 +1.0894 0 0 0 0
10m
Comparing simdsort/random_10m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_10m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_10m vs. simdsort/random_10m]/uint64_t -0.0393 -0.1954 25032500 24047678 25031503 20141074
[simdsort/random_10m vs. simdsort/random_10m]/int64_t -0.0184 -0.1239 23815509 23376537 23802992 20853016
[simdsort/random_10m vs. simdsort/random_10m]/uint32_t +0.5229 +0.5236 7872549 11989459 7869377 11989398
[simdsort/random_10m vs. simdsort/random_10m]/int32_t +0.5457 +0.5460 7929956 12257078 7927239 12255340
[simdsort/random_10m vs. simdsort/random_10m]/uint16_t +0.3530 +0.3530 5700344 7712651 5700039 7712366
[simdsort/random_10m vs. simdsort/random_10m]/int16_t +0.3061 +0.3060 5710822 7459049 5710548 7457834
[simdsort/random_10m vs. simdsort/random_10m]/float +0.5283 +0.5281 7984395 12202217 7984079 12200205
[simdsort/random_10m vs. simdsort/random_10m]/double -0.0092 -0.1719 22430352 22224402 22429597 18573556
[simdsort/random_10m vs. simdsort/random_10m]/_Float16 +0.6462 +0.6463 4400445 7243948 4399974 7243662
OVERALL_GEOMEAN +0.2883 +0.2228 0 0 0 0
100m
Comparing simdsort/random_100m (from /home/sterrettm/xss/x86-simd-sort/.bench/main/builddir/benchexe) to simdsort/random_100m (from /home/sterrettm/xss/x86-simd-sort/.bench/raghu_omp/builddir/benchexe)
Benchmark Time CPU Time Old Time New CPU Old CPU New
-------------------------------------------------------------------------------------------------------------------------------------------------------
[simdsort/random_100m vs. simdsort/random_100m]/uint64_t -0.0767 -0.8039 580935369 536395893 578418465 113435125
[simdsort/random_100m vs. simdsort/random_100m]/int64_t -0.0477 -0.6396 586234077 558282477 586229954 211306420
[simdsort/random_100m vs. simdsort/random_100m]/uint32_t -0.0818 -0.4046 232056239 213083404 231701433 137960312
[simdsort/random_100m vs. simdsort/random_100m]/int32_t -0.0364 -0.5359 232368921 223917915 232213279 107775511
[simdsort/random_100m vs. simdsort/random_100m]/uint16_t -0.0501 -0.4083 80935109 76880782 80918052 47882478
[simdsort/random_100m vs. simdsort/random_100m]/int16_t -0.0658 -0.3608 81683614 76308548 81642565 52188516
[simdsort/random_100m vs. simdsort/random_100m]/float -0.0288 -0.3741 237212078 230381039 236491140 148012758
[simdsort/random_100m vs. simdsort/random_100m]/double +0.0422 -0.4587 573784986 598021062 569874959 308481275
[simdsort/random_100m vs. simdsort/random_100m]/_Float16 -0.0881 -0.2259 81381343 74207809 81373671 62989080
OVERALL_GEOMEAN -0.0488 -0.5004 0 0 0 0
I am very interested in why the CPU time is negative for some of these tests. I think with so many threads being spawned, maybe OpenMP is no longer having them spin?
It seems like in those cases the performance is improved, and CPU utilization is reduced. If we could force that behavior all the time, maybe setting the thread limit this high could work for smaller sizes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! About 11% improvement for 100m, about 21% for 10m, seems like a good idea! I'll merge this once the CI passes.
This pull request refactors the logic for determining the number of threads used in parallel sorting functions by introducing a new helper function,
xss_get_num_threads
. This change ensures consistency and simplifies thread count management across multiple sorting implementations.